rfcs: graph: support int4/int8 compression for K/V in fused SDPA #2041

wzt1997 · 2024-08-16T06:42:40Z

Description

This is to propose to support int4/int8 compression for K/V in fused SDPA.
Link to the rendered document.

dzarukin · 2024-08-20T18:32:15Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+      inputs should be `1d` tensors.
+   2. For `per_group` quantization, all dimensions should match the input,
+      excepts for the dimension where grouped quantization applies, which should
+      be `src_dim / group_size`.


Not sure how that matches the case when Batch dimension should be broadcasted for scales/zero-points:

W = [B, K, N] pre-scale W: [1, gK, N] x [B, K, N] = W' matmul: [B, M, K] x W' = [B, M, N]

Use case I've seen are like that, batch dimension doesn't receive its own dimension of scales.

Thanks for providing the case. According to potential request from IPEX and per-token quantization, we decide to update the scale/zp input shape requirement for per-group quantization for scalability and flexibility. One potential option is to allow 1 on dimensions other than the last two, K and N.

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

TaoLv · 2024-08-30T02:05:31Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+    /// 4-bit signed integer.
+    dnnl_s4 = 11,
+    /// 4-bit unsigned integer.
+    dnnl_u4 = 12,


For API completeness, I feel yes we need to add both s4 and u4. Just for my curiosity, do you know if both s4 and u4 are used for this int4 K/V compression request? If both are used, any difference in quantization recipe between s4 and u4?

According to user request, they are likely to use s4 for KV storage. But as the KV compression is still WIP, it might change in the future. Regarding the difference between u4 and s4 recipes, since int4 data types always use asymmetric quantization, the parameters will be quite similar for u4 and s4. The difference should be mainly related to the de-quantization logic.

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

dzarukin

Thanks for updating the document. I left some minor comments.

dzarukin · 2024-11-05T20:30:48Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+are required for each hidden dimension to maintain the model accuracy. The
+requirement for grouped zero points is not promised. See [int4


The requirement for grouped zero points is not promised.

Could you, please, elaborate on the meaning of the phrase and how it affects the proposal.

dzarukin · 2024-11-05T20:35:58Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+
+1. Add `per_group` to the supported values of `qtype` attribute, and the default
+   value will be unchanged.
+2. The existing attribute `axis` will ignored if `per_group` quantization is


Suggested change

2. The existing attribute `axis` will ignored if `per_group` quantization is

2. The existing attribute `axis` will be ignored if `per_group` quantization is

dzarukin · 2024-11-05T20:38:06Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+   each quantization group( which is `group_size` ).  And the other dimensions
+   should be all `1`. The attribute is required when `per_group` quantization
+   type is specified for `qtype`. If `per_group` quantization is not specified
+   and `group_shape` attribute is given, it will be ignored. 


It seems it's implicitly assumed that the number of dimensions in the group_shape attribute coincides with the number of dimensions of the input shape. It would good to explicitly word it out if that's the case.

dzarukin · 2024-11-05T20:42:19Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+const size_t nG = 2, G = N / nG;
+
+dims src_dims = {K, N};
+dims scale_dims = {K, N/G};


Suggested change

dims scale_dims = {K, N/G};

dims scale_dims = {K, nG};

dzarukin · 2024-11-05T20:43:41Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+// Specify the quantization type as per_group quantization.
+deq.set_attr<std::string>(op::attr::qtype, "per_group");
+// Axis indicates on which dimension the quantization will be applied.
+deq.set_attr<int64_t>(op::attr::axis, 1);


Looks like this is not needed any longer.

dzarukin · 2024-11-05T20:46:02Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+// ...
+```
+
+##### Limitation


Suggested change

##### Limitation

#### Limitation

dzarukin · 2024-11-05T20:47:27Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+Similar to the discussion of int8 and fp8 data types, an alternative solution
+may be supporting int4 data types directly in computation operations like MatMul
+and Convolution. But it will bloat and complicate the opset and op schema of
+oneDNN Graph. Hence it's not considered here.


I feel this paragraph is not quite relevant as the path for integer data types was already paved and it won't change.

I'd rather replace it with why a vector over a single value + axis combination as it was initially. This is more valuable information for future reference.

dzarukin · 2024-11-05T20:47:51Z

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md

+In addition, to enhance clarity and reduce compilation overhead, oneDNN Graph
+will support the direct quantization between int4 data types and bf16/f16.
+This means that additional `Typecast` operations will no longer be necessary
+between quantization processes and subsequent operations in bf16/f16 computation
+graphs. Although it will complicate the fusion patterns to some extend, we can
+address this by implementing more precise pattern definitions.


It seems to me that this paragraph belongs to a Proposal 3 section instead since it already has TypeCast dropped from the picture.

wzt1997 added the RFC A design document label Aug 16, 2024

wzt1997 requested review from karturov, TaoLv and rongzha1 August 16, 2024 06:42

wzt1997 self-assigned this Aug 16, 2024

wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from 943a6b3 to 83217a2 Compare August 16, 2024 06:57

vpirogov changed the title ~~rfcs: graph: support int4/int8 compression for K/V in fuesd SDPA~~ rfcs: graph: support int4/int8 compression for K/V in fused SDPA Aug 16, 2024

dzarukin reviewed Aug 20, 2024

View reviewed changes

wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from 1d5e1d2 to d633cf7 Compare August 27, 2024 08:31

wzt1997 requested a review from a team as a code owner August 27, 2024 08:31

wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch 4 times, most recently from 04a94db to 5b84ad0 Compare August 29, 2024 00:28

TaoLv reviewed Aug 30, 2024

View reviewed changes

wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from 5b84ad0 to 6f68c51 Compare September 2, 2024 02:49

ElaineBao reviewed Sep 4, 2024

View reviewed changes

wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from 6f68c51 to f1c5003 Compare September 6, 2024 09:06

vpirogov added this to the v3.7 milestone Sep 9, 2024

wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch 2 times, most recently from c6ed1d1 to f1c0722 Compare September 23, 2024 02:43

wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from f1c0722 to a8a3a20 Compare October 8, 2024 07:53

wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch 2 times, most recently from 4aae2bf to 0ac893e Compare October 30, 2024 07:51

rfcs: support int4/int8 compression for K/V in fuesd SDPA

810ef3e

wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from 0ac893e to 810ef3e Compare November 1, 2024 06:03

dzarukin reviewed Nov 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfcs: graph: support int4/int8 compression for K/V in fused SDPA #2041

rfcs: graph: support int4/int8 compression for K/V in fused SDPA #2041

wzt1997 commented Aug 16, 2024

dzarukin Aug 20, 2024

wzt1997 Aug 27, 2024 •

edited

Loading

TaoLv Aug 30, 2024

wzt1997 Aug 30, 2024

dzarukin left a comment

dzarukin Nov 5, 2024

dzarukin Nov 5, 2024

dzarukin Nov 5, 2024

dzarukin Nov 5, 2024

dzarukin Nov 5, 2024

dzarukin Nov 5, 2024

dzarukin Nov 5, 2024

dzarukin Nov 5, 2024

		are required for each hidden dimension to maintain the model accuracy. The
		requirement for grouped zero points is not promised. See [int4

	2. The existing attribute `axis` will ignored if `per_group` quantization is
	2. The existing attribute `axis` will be ignored if `per_group` quantization is

rfcs: graph: support int4/int8 compression for K/V in fused SDPA #2041

Are you sure you want to change the base?

rfcs: graph: support int4/int8 compression for K/V in fused SDPA #2041

Conversation

wzt1997 commented Aug 16, 2024

Description

Choose a reason for hiding this comment

wzt1997 Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzarukin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzt1997 Aug 27, 2024 •

edited

Loading